QTM 447 Lecture 22: Bayesian Machine Learning

Kevin McAlister

April 3, 2025

Bayesian Machine Learning

Bayesian Methods treat the standard machine learning problem in a slightly different way

Let \(\boldsymbol \theta\) be the collection of unknowns that we would like to learn from our training data in a way that minimizes generalization error

Let \(\mathcal D\) be the collection of knowns - data, any fixed tuning parameters that are tuned after the fact or just known, etc.

The posterior distribution over the unknowns can be defined using Bayes theorem:

\[f(\boldsymbol \theta | \mathcal D) = \frac{f(\mathcal D | \boldsymbol \theta)f(\boldsymbol \theta)}{\int \limits_{\boldsymbol \theta} f(\mathcal D | \boldsymbol \theta)f(\boldsymbol \theta) d \boldsymbol \theta}\]

Bayesian Machine Learning

\[f(\boldsymbol \theta | \mathcal D) = \frac{f(\mathcal D | \boldsymbol \theta)f(\boldsymbol \theta)}{\int \limits_{\boldsymbol \theta} f(\mathcal D | \boldsymbol \theta)f(\boldsymbol \theta) d \boldsymbol \theta}\]

Likelihood:

\[f(\mathcal D | \boldsymbol \theta)\]

Given a value of the unknowns, what is the likelihood with which we would see any random knowns?

  • For linear regression:

\[f(\mathbf y | \mathbf X , \boldsymbol \beta , \sigma^2) \sim \prod \limits_{i = 1}^N \mathcal N(y_i | \mathbf x_i^T \boldsymbol \beta , \sigma^2) = \mathcal N_N(\mathbf y | \mathbf X \boldsymbol \beta, \sigma^2 \mathcal I_N)\]

Bayesian Machine Learning

\[f(\boldsymbol \theta | \mathcal D) = \frac{f(\mathcal D | \boldsymbol \theta)f(\boldsymbol \theta)}{\int \limits_{\boldsymbol \theta} f(\mathcal D | \boldsymbol \theta)f(\boldsymbol \theta) d \boldsymbol \theta}\]

Prior:

\[f(\boldsymbol \theta)\]

  • Prior to seeing any data, what are our beliefs/desires for the unknowns?

  • Chosen for convenience/purpose.

  • For linear regression (with conjugate prior):

\[f(\boldsymbol \beta) \sim \mathcal N_P(\boldsymbol \beta | \boldsymbol \mu_0 , \boldsymbol \Sigma_0)\]

Bayesian Machine Learning

Conditioning the prior beliefs on the data, we get the posterior distribution:

\[f(\boldsymbol \theta | \mathcal D)\]

  • A proper PDF over the unknowns

  • A probability distribution over the unknown quantities

  • Can define a most likely value, intervals, etc.

Bayesian Methods

Before jumping into regression, let’s look at a simpler example that demonstrate the Bayesian estimation process.

Example: Normal likelihood, known variance

Suppose that we observe \(N\) instances of \(y\) that we believe are i.i.d. draws from a normal distribution with a known value for the variance. \(\mu\) - the mean of the normal - is unknown.

Likelihood:

\[f(\mathbf y | \mu , \sigma^2) = \prod \limits_{i = 1}^N \frac{1}{\sqrt{2 \pi \sigma^2}} \exp \left[ - \frac{1}{2 \sigma^2} (y_i - \mu)^2 \right]\]

Bayesian Methods

Our prior distribution over the unknown, \(\mu\), will be a normal distribution:

\[f(\mu) \sim \mathcal N(\mu | \mu_0 , \sigma^2_0)\]

with some arbitrary mean and variance.

Bayesian Methods

We can define our posterior distribution as:

\[f(\mu | \mathbf y) = \frac{\prod \limits_{i = 1}^N \mathcal N(y_i | \mu , \sigma^2)\mathcal N(\mu | \mu_0 , \sigma^2_0)}{\int \limits_{-\infty}^{\infty} \prod \limits_{i = 1}^N \mathcal N(y_i | \mu , \sigma^2)\mathcal N(\mu | \mu_0 , \sigma^2_0) d\mu}\]

This is our posterior distribution!

  • It’s just not in a useful form

  • What is the mode of the posterior?

  • What is the variance of the posterior?

Bayesian Methods

How do we make posteriors useful?

Four strategies:

  • Analytically find an easy to work with form for the posterior

  • Find the posterior mode and apply a Laplace approximation

  • Take a large number of random draws from the posterior (Not covered in this class, see Gibbs and Metropolis approaches)

  • Find an approximation that maximizes the marginal likelihood

Bayesian Methods

Sometimes, posterior distributions can be analytically simplified

  • Generally only the case when our likelihood and prior are of a certain form

Let’s show an example of this.

Bayesian Methods

\[f(\mu | \mathbf y) = \frac{\prod \limits_{i = 1}^N \mathcal N(y_i | \mu , \sigma^2)\mathcal N(\mu | \mu_0 , \sigma^2_0)}{\int \limits_{-\infty}^{\infty} \prod \limits_{i = 1}^N \mathcal N(y_i | \mu , \sigma^2)\mathcal N(\mu | \mu_0 , \sigma^2_0) d\mu}\]

First simplification:

The denominator is a constant that is not random!

  • We’re integrating over \(\mu\) and everything else is known!

  • It’s purpose is to ensure that the posterior integrates to 1.

Bayesian Methods

\[f(\mu | \mathbf y) = \frac{1}{Z} \prod \limits_{i = 1}^N \mathcal N(y_i | \mu , \sigma^2)\mathcal N(\mu | \mu_0 , \sigma^2_0)\]

  • Now, let’s write out the equation for the normals to see if there’s any simplifications we can make

\[f(\mu | \mathbf y) = \frac{1}{Z} \prod \limits_{i = 1}^N \frac{1}{\sqrt{2 \pi \sigma^2}} \exp \left[ - \frac{1}{2 \sigma^2} (y_i - \mu)^2 \right] \frac{1}{\sqrt{2 \pi \sigma_0^2}} \exp \left[ - \frac{1}{2 \sigma_0^2} (\mu - \mu_0)^2 \right]\]

Simplification #2:

We only care about the random terms

  • e.g. terms that have to do with \(\mu\)

  • Anything else can be moved to the front constant

Bayesian Methods

\[f(\mu | \mathbf y) = C \prod \limits_{i = 1}^N \exp \left[ - \frac{1}{2 \sigma^2} (y_i - \mu)^2 \right] \exp \left[ - \frac{1}{2 \sigma_0^2} (\mu - \mu_0)^2 \right]\]

The product of exponentials is the sum of the exponents terms:

\[f(\mu | \mathbf y) = C \exp \left[ - \frac{1}{2 \sigma^2} \sum \limits_{i = 1}^N (y_i - \mu)^2 \right] \exp \left[ - \frac{1}{2 \sigma_0^2} (\mu - \mu_0)^2 \right]\]

Applied to wrap the prior in:

\[f(\mu | \mathbf y) = C \exp \left[ - \frac{1}{2 \sigma^2} \sum \limits_{i = 1}^N (y_i - \mu)^2 - \frac{1}{2 \sigma_0^2} (\mu - \mu_0)^2 \right]\]

Bayesian Methods

Expanding the squares and distributing the sum:

\[f(\mu | \mathbf y) \propto \exp \left[ - \frac{1}{2 \sigma^2} \left(\sum y_i^2 - 2 \mu \sum y_i + N \mu^2 \right) - \frac{1}{2 \sigma_0^2} \left(\mu^2 - 2 \mu \mu_0 + \mu_0^2 \right) \right]\]

Combining like terms:

\[f(\mu | \mathbf y) \propto \exp - \frac{1}{2} \bigg(\frac{1}{\sigma^2} \sum y_i^2 - \frac{1}{\sigma^2_0} \mu_0 \\- 2 \mu \left(\frac{1}{\sigma^2}\sum y_i + \frac{1}{\sigma^2_0} \mu_0 \right) + \mu^2 \left(\frac{N}{\sigma^2} + \frac{1}{\sigma^2_0} \right) \bigg)\]

Now, take advantage of the fact that we can get rid of any multiplicative terms that don’t have to do with \(\mu\)

Bayesian Methods

\[f(\mu | \mathbf y) = C' \exp \left[ - \frac{1}{2} \left(- 2 \mu \left(\frac{1}{\sigma^2}\sum y_i + \frac{1}{\sigma^2_0} \mu_0 \right) + \mu^2 \left(\frac{N}{\sigma^2} + \frac{1}{\sigma^2_0} \right) \right) \right]\]

This looks a lot like the kernel of a normal distribution:

\[C \exp\left[-\frac{1}{2 s^2}(z - m)^2 \right] = C' \exp\left[-\frac{1}{2 s^2}(z^2 - 2 z m) \right]\]

Can we put this in the form of the Gaussian kernel?

Bayesian Methods

The Gaussian identity for random variable \(y\): \[C' \exp\left[-\frac{1}{2}(ay^2 - 2 b y) \right]\]

is equivalent to a normal distribution (up to a multiplicative constant) where:

\[\hat{\sigma}^2 = a^{-1}\]

\[\hat{\mu} = -a^{-1}b\]

Bayesian Methods

This means that our posterior is a normal distribution:

\[f(\mu | \mathbf y) \sim \mathcal N(\mu | \hat{\mu} , \hat{\sigma}^2)\]

where

\[\hat{\sigma}^2 = \left( \frac{N}{\sigma^2} + \frac{1}{\sigma^2_0} \right)^{-1}\]

and

\[\hat{\mu} = \left( \frac{N}{\sigma^2} + \frac{1}{\sigma^2_0} \right)^{-1}\left(\frac{1}{\sigma^2}\sum y_i + \frac{1}{\sigma^2_0} \mu_0 \right)\]

Bayesian Methods

That’s convenient that we ended up with a normal posterior distribution!

It’s not a coincidence

Conjugate Prior

For certain likelihood forms, the posterior is of the form of the prior

  • Normal/Normal

  • Beta/Bernoulli

  • Multivariate Normal/Multivariate Normal

Prior form is chosen out of convenience!

Bayesian Methods

Interesting observation:

\[\hat{\mu} = \left( \frac{N}{\sigma^2} + \frac{1}{\sigma^2_0} \right)^{-1}\left(\frac{1}{\sigma^2}\sum y_i + \frac{1}{\sigma^2_0} \mu_0 \right)\]

  • If the prior variance is small (meaning a really precise prior or we have a strong belief about the location of \(\mu\)), the posterior mean will be really close to \(\mu_0\)

  • The numerator will be dominated by the value of the prior mean

Bayesian Methods

Interesting observation:

What happens if we have a prior with infinite variance?

  • All values of \(\mu\) between \(-\infty\) and \(\infty\) are equally likely a priori

\[\hat{\sigma}^2 = \left( \frac{N}{\sigma^2} + \frac{1}{\sigma^2_0} \right)^{-1} = \frac{\sigma^2}{N}\]

and

\[\hat{\mu} = \left( \frac{N}{\sigma^2} + \frac{1}{\sigma^2_0} \right)^{-1}\left(\frac{1}{\sigma^2}\sum y_i + \frac{1}{\sigma^2_0} \mu_0 \right) = \frac{\sigma^2}{N} \frac{\sum y_i}{\sigma^2} = \frac{\sum y_i}{N}\]

The moments of the posterior for the mean become the maximum likelihood estimates: the sample average and the standard error.

Bayesian Methods

Bayesian Methods

Bayesian Methods

Bayesian Methods

Bayesian Methods

We can think of MLE as a special case of Bayesian methods where we have diffuse priors

  • We don’t have any real prior beliefs.

Another interesting result:

As \(N \to \infty\) with a non-infinite variance prior

\[\hat{\sigma}^2 \approx \frac{\sigma^2}{N}\]

\[\hat{\mu} \approx \frac{\sum y_i}{N}\]

  • At a certain point, our prior doesn’t matter!

  • We can see MLE as the asymptotic limit of the posterior distribution as \(N\) gets large

Bayesian Methods

Bayesian Methods

Bayesian Methods

Bayesian Linear Regression

How is this useful for us?

Let’s now look at the linear regression problem. We want to find the posterior over the regression coefficients given the data we observe and assuming a known variance:

\[f(\boldsymbol \beta | \mathbf X , \mathbf y, \sigma^2)\]

The likelihood:

\[f(\mathbf y | \mathbf X , \boldsymbol \beta , \sigma^2) = \prod \limits_{i = 1}^N \mathcal N(y_i | \mathbf x^T \boldsymbol \beta , \sigma^2)\]

Because this is a product of independent normals, we can represent this via a \(N\) dimensional multivariate normal distribution:

\[f(\mathbf y | \mathbf X , \boldsymbol \beta , \sigma^2) \sim \mathcal N_N(\mathbf y | \mathbf X \boldsymbol \beta , \sigma^2 \mathcal I_N)\]

Bayesian Linear Regression

The prior:

Let’s say that the prior over the \(P\) coefficients follows a multivariate normal distribution with mean \(\boldsymbol \mu_0\) and covariance matrix \(\boldsymbol \Sigma_0\)

\[f(\boldsymbol \beta) \sim \mathcal N_P(\boldsymbol \beta | \boldsymbol \mu_0 , \boldsymbol \Sigma_0)\]

This means that our posterior is:

\[f(\boldsymbol \beta | \mathbf X , \mathbf y, \sigma^2) = \frac{1}{Z} \mathcal N_N(\mathbf y | \mathbf X \boldsymbol \beta , \sigma^2 \mathcal I_N) \mathcal N_P(\boldsymbol \beta | \boldsymbol \mu_0 , \boldsymbol \Sigma_0)\]

Bayesian Linear Regression

What we know:

  • A multivariate normal distribution (of any dimensionality) is conjugate to a multivariate normal likelihood!

This means that:

\[f(\boldsymbol \beta | \mathbf X , \mathbf y, \sigma^2) \sim \mathcal N_P(\boldsymbol \beta | \hat{\boldsymbol \mu} , \hat{\boldsymbol \Sigma})\]

Bayesian Linear Regression

The proof for this is quite long, but follows closely to the normal/normal proof we did above. It is outlined in all its glory in the lecture notes for Bayesian Regression included on the Canvas site.

The punchline:

\[f(\boldsymbol \beta | \mathbf X , \mathbf y, \sigma^2) \sim \mathcal N_P(\boldsymbol \beta | \hat{\boldsymbol \mu} , \hat{\boldsymbol \Sigma})\]

\[\hat{\boldsymbol \Sigma} = \left(\frac{1}{\sigma^2} \mathbf X^T \mathbf X + \boldsymbol \Sigma_0^{-1} \right)^{-1}\]

\[\hat{\boldsymbol \mu} = \hat{\boldsymbol \Sigma}\left(\frac{1}{\sigma^2} \mathbf X^T \mathbf y + \boldsymbol \Sigma^{-1}_0 \boldsymbol \mu_0 \right)\]

Bayesian Linear Regression

Implication 1:

When all of the prior variances are diffuse

\[\hat{\boldsymbol \mu} = \left(\frac{1}{\sigma^2} \mathbf X^T \mathbf X \right)^{-1}\left(\frac{1}{\sigma^2} \mathbf X^T \mathbf y \right) = (\mathbf X^T \mathbf X)^{-1}\mathbf X^T \mathbf y\]

  • OLS is the solution when we have no prior info!

  • This also holds when the number of training samples goes to \(\infty\)

Bayesian Linear Regression

Implication 2:

Let’s set \(\boldsymbol \mu_0 = \mathbf 0\) and \(\boldsymbol \Sigma_0 = \tau^2 \mathcal I_P\)

This means that:

\[\hat{\boldsymbol \Sigma} = \left(\frac{1}{\sigma^2} \mathbf X^T \mathbf X + \frac{1}{\tau^2} \mathcal I_P \right)^{-1}\]

\[\hat{\boldsymbol \mu} = \hat{\boldsymbol \Sigma}\left(\frac{1}{\sigma^2} \mathbf X^T \mathbf y + \frac{1}{\tau^2} \boldsymbol \mu_0 \right)\]

Bayesian Linear Regression

Let \(\lambda = \frac{\tau^2}{\sigma^2}\). We can rearrange the posterior mean to be:

\[\hat{\boldsymbol \mu} = (\mathbf X^T \mathbf X + \lambda \mathcal I_P)^{-1} \mathbf X^T \mathbf y\]

which is the analytical solution for Ridge regression

Regularization with the L2 norm is Bayesian regression with a normal prior on the coefficients centered at 0!

Bayesian Linear Regression

Regularization methods are us setting up desired or believed values for the regression coefficients

  • We want the regression coefficients to be closer to zero to better generalize!

The L2 penalty (therefore weight decay) is a Bayesian solution to the generalization problem

  • Inject our prior desire to find a solution set that is simpler (e.g. lower norm of the coefficients)

The strength of this regularization is then the level at which we want to simplify our results

  • Set this in accordance with our goal to generalize as well as possible

Bayesian Linear Regression

The Bayesian approach also justifies letting the model be more complex as \(N\) grows

  • Our prior desire for simplicity doesn’t matter as much when \(N\) gets large since the regularized solution converges to the MLE solution via ERM with no penalties.

I find this to be a satisfying justification for regularization!

Bayesian Machine Learning

Bayesian methods are regularized by construction

Regularization methods are us setting up desired or believed values for the regression coefficients

  • We want the regression coefficients to be closer to zero to better generalize!

The L2 penalty (therefore weight decay) is a Bayesian solution to the generalization problem

  • Inject our prior desire to find a solution set that is simpler (e.g. lower norm of the coefficients)

The strength of this regularization is then the level at which we want to simplify our results

  • Set this in accordance with our goal to generalize as well as possible

Bayesian Machine Learning

Bayesian Machine Learning

Bayesian Machine Learning

Bayesian Machine Learning

The prior variance serves the same purpose as \(\lambda\) in regularized methods!

  • Set the prior variance in accordance with how simple we would like our model to be

  • Choose the prior variance in such a way that we minimize generalization error!

Bayesian machine learning is regularized machine learning

  • Simplicity reigns supreme

  • Allow models to grow more complex as \(N\) goes to \(\infty\)

  • The prior doesn’t matter so much as \(N\) gets large

Bayesian Machine Learning

Aside from the regularization benefits, it also allows us to reasonably estimate coefficients when \(P \approx N\) of \(P > N\).

\[\hat{\boldsymbol \mu} = \left(\frac{1}{\sigma^2} \mathbf X^T \mathbf X + \tau^2 \mathcal I_P \right)^{-1}\left(\frac{1}{\sigma^2} \mathbf X^T \mathbf y \right)\]

  • When \(P > N\), \(\mathbf X^T \mathbf X\) is not invertible as the determinant is zero

  • Adding a positive constant to the diagonal makes the matrix invertible

We can interpret this in terms of the prior:

  • If \(\mathbf X^T \mathbf y\) is large, let the data tell us the coefficient value

  • If \(\mathbf X^T \mathbf y\) is smaller, default back to the prior and say that the coefficient is essentially zero

  • In an undetermined system, just default back to zero for any excess coefficients.

Bayesian Machine Learning

Bayesian methods aren’t just a theoretical justification for regularization methods.

  • Certain methods are more easily trained using Bayesian methods

Gaussian Processes for really curvy functional forms

Hierarchical Models (coefficients are nested due to cluster structure)

Relevance Vector Machines (an SVM variant that allows one to skip cross validation in selecting the cost)

Bayesian Machine Learning

The really common one:

Suppose we have a feature matrix, \(\mathbf X\), that is a \(N \times P\) matrix.

We believe that \(X\) can be represented in a lower dimensional space by the PDF \(Z \in \mathbb R^{M}\) where \(M << P\).

PCA:

\[f(\mathbf X) \approx f(\mathbf Z) \text{ ; } \mathbf x_i = \lambda^T \mathbf z_i\]

Bayesian Machine Learning

Maybe the mapping is nonlinear

\[f(\mathbf X) \approx f(\mathbf Z) \text{ ; } \mathbf x_i = g(z_i)\]

Then the goal is to find

\[g(\mathbf z_i)\]

that closely approximates \(\mathbf x_i\).

This general problem is called an autoencoder

Summarizing Posteriors

We know that the posterior for Bayesian linear regression is a multivariate normal distribution.

But we often want to present a summary of the different coefficients that is more easily interpreted.

Two common quantities:

  • The maximum a posteriori (MAP) estimate

  • A 95% credible interval

Summarizing Posteriors

The MAP estimate

Given a posterior \(f(\boldsymbol \theta | \mathcal D)\), the MAP estimate is the point over the unknowns associated with the highest density

\[\hat{\boldsymbol \theta} = \underset{\boldsymbol \theta}{\text{argmax }} f(\boldsymbol \theta | \mathcal D)\]

  • Given the posterior, what is the most likely value of the unknowns?

Summarizing Posteriors

For normal posteriors, this is easy as the high density point is the mean:

\[\hat{\boldsymbol \beta} = \left(\frac{1}{\sigma^2} \mathbf X^T \mathbf X + \boldsymbol \Sigma_0^{-1} \right)^{-1}\left(\frac{1}{\sigma^2} \mathbf X^T \mathbf y + \boldsymbol \Sigma^{-1}_0 \boldsymbol \mu_0 \right)\]

  • The MAP estimate can be used in a more novel way (more soon)

Summarizing Posteriors

We also want to summarize the uncertainty surrounding our MAP estimates

This can always be determined by our posterior variance, \(\hat{\boldsymbol \Sigma}\)

Instead of the posterior variance, we might want to present intervals around the MAP estimates of a specific size

Bayesian 95% Credible Intervals

A region of the posterior that includes 95% of the posterior mass. Conditioned on the data and our prior, a 95% chance that the unknowns lie in this region

  • How we always want to interpret the confidence interval!

Summarizing Posteriors

For multivariate normals, the marginal posterior and corresponding marginal 95% HPD intervals are easy to compute.

Marginals:

The marginal distribution for each element of a random vector that follows a multivariate normal distribution follows a normal distribution with mean \(\hat{\boldsymbol \mu}_j\) and variance \(\hat{\boldsymbol \Sigma}_{j,j}\)

Summarizing Posteriors

Marginal 95% HPD:

Since the normal distribution is symmetric around the mean, the 95% HPD is then:

\[\hat{\boldsymbol \mu}_j \pm 1.96 \hat{\boldsymbol \Sigma}_{j,j}\]

  • Letting \(\boldsymbol \Sigma_0\) be diffuse, this is equivalent to your standard ol’ confidence interval

  • All these years, you could’ve been interpreting confidence intervals in the way that you want…

MAP Approximations

Sometimes we can find the posterior via analytical methods

  • Kernel matching for the Normal/Normal model

What if our prior/likelihood doesn’t admit this kind of simplification?

The Laplace distribution:

\[f(x | \mu , b) = \frac{1}{2b} \exp \left[- \frac{|x - \mu|}{b} \right]\]

MAP Approximations

MAP Approximations

The Laplace distribution is sharper at the peak than the normal distribution.

Maybe we want to define a regression model with independent Laplace priors on the coefficients!

Let’s make the prior over our regression coefficients independent Laplace priors:

\[f(\boldsymbol \beta) = \prod \limits_{j = 1}^P \frac{1}{2b_{j0}} \exp \left[- \frac{|\beta_j - \mu_{j0}|}{b_{j0}} \right]\]

Then the posterior using normal errors with constant variance becomes:

\[ f(\boldsymbol \beta | \mathbf X , \mathbf y , \sigma^2) = \frac{\prod \limits_{i = 1}^N \mathcal N(y_i | \mathbf x_i^T \boldsymbol \beta , \sigma^2)\prod \limits_{j = 1}^P \frac{1}{2b_{j0}} \exp \left[- \frac{|\beta_j - \mu_{j0}|}{b_{j0}} \right]}{\int \limits_{\boldsymbol \beta} \prod \limits_{i = 1}^N \mathcal N(y_i | \mathbf x_i^T \boldsymbol \beta , \sigma^2)\prod \limits_{j = 1}^P \frac{1}{2b_{j0}} \exp \left[- \frac{|\beta_j - \mu_{j0}|}{b_{j0}} \right] d \boldsymbol \beta} \]

MAP Approximations

Our posterior isn’t going to simplify!

  • Squares and absolute values don’t mix…

The analytical approach:

\[\text{Posterior} \rightarrow \text{MAP} \rightarrow \text{Uncertainty}\]

The MAP approach:

\[\text{MAP} \rightarrow \text{Uncertainty} \rightarrow \text{Posterior}\]

MAP Approximations

Find the posterior summary quantities and then approximate the posterior.

  • This will work quite well in a lot of scenarios!

  • Will also be viable and quick when other approaches won’t.

Specifically, find the multivariate normal distribution over the unknowns that most closely approximates the posterior.

  • Minimize the distance between the true posterior and the approximate one.

MAP Approximations

A result that I won’t prove - Bernstein-Von Mises Theorem.

BVM states:

Under some regularity conditions on the likelihood and prior, if the log of the unnormalized posterior is continuous and twice differentiable, then the posterior converges in distribution to a multivariate normal distribution with mean:

\[\hat{\boldsymbol \mu} = \underset{\boldsymbol \Theta}{\text{argmax }}\log f(\boldsymbol \Theta | \mathbf X)\]

and covariance matrix:

\[\hat{\boldsymbol \Sigma} = -\left[ \frac{\partial \log f(\hat{\boldsymbol \mu} | \mathbf X)}{\partial \boldsymbol \Theta \partial \boldsymbol \Theta} \right]^{-1} \]

as \(N \to \infty\).

  • A sort of Bayesian central limit theorem!

MAP Approximations

Additionally, this approximation is the multivariate normal approximation to the true posterior distribution that minimizes the Kullback-Leibler divergence between the true and approximate MVN.

Let \(P = \mathcal N(\boldsymbol \Theta | \hat{\boldsymbol \mu}, \hat{\boldsymbol \Sigma})\) and let \(Q = f(\boldsymbol \Theta | \mathcal D)\). Define the KL divergence as:

\[D_{KL}(P || Q) = \int \limits_\mathbf x P(\mathbf x) \log \frac{P(\mathbf x)}{Q(\mathbf x)} d\mathbf x\]

  • When \(P(\mathbf x) = Q(\mathbf x) \forall \mathbf x \in \mathbb R^P\), this “distance” evaluates to zero.

  • Grows in difference between \(P(\mathbf x)\) and \(Q(\mathbf x)\)

  • The MAP approximation is the best we can do with a MVN regardless of sample size!

MAP Approximations

Let’s do a MAP approximation for the MVN/MVN linear regression case.

Our (unnormalized) posterior is:

\[f(\boldsymbol \beta | \mathbf X , \mathbf y, \sigma^2) \propto \mathcal N_N(\mathbf y | \mathbf X \boldsymbol \beta , \sigma^2 \mathcal I_N) \mathcal N_P(\boldsymbol \beta | \boldsymbol \mu_{0} , \boldsymbol \Sigma_0)\]

With formula substituted and getting rid of terms that don’t have to do with \(\boldsymbol \beta\):

\[ \begin{split} f(\boldsymbol \beta | \mathbf X , \mathbf y , \sigma^2) \propto & \exp \bigg[-\frac{1}{2} (\mathbf y - \mathbf X \boldsymbol \beta)^T \frac{1}{\sigma^2} \mathcal I_N (\mathbf y - \mathbf X \boldsymbol \beta) \bigg] \\ & \exp \bigg[-\frac{1}{2} (\boldsymbol \beta - \boldsymbol \mu_0)^T \boldsymbol \Sigma_0^{-1} (\boldsymbol \beta - \boldsymbol \mu_0) \bigg] \end{split} \]

MAP Approximations

To make this more applicable, let’s set \(\boldsymbol \mu_0 = \mathbf 0\) and \(\boldsymbol \Sigma_0 = \tau^2 \mathcal I_P\).

\[ \begin{split} f(\boldsymbol \beta | \mathbf X , \mathbf y , \sigma^2) \propto & \exp \bigg[-\frac{1}{2 \sigma^2} (\mathbf y - \mathbf X \boldsymbol \beta)^T (\mathbf y - \mathbf X \boldsymbol \beta) -\frac{1}{2\tau^2} \boldsymbol \beta^T \boldsymbol \beta \bigg] \end{split} \]

First step: take the log

\[ \begin{split} \log f(\boldsymbol \beta | \mathbf X , \mathbf y , \sigma^2) \propto & -\frac{1}{2 \sigma^2} (\mathbf y - \mathbf X \boldsymbol \beta)^T (\mathbf y - \mathbf X \boldsymbol \beta) -\frac{1}{2\tau^2} \boldsymbol \beta^T \boldsymbol \beta \end{split} \]

MAP Approximations

To make things even easier, let’s let \(\lambda = \frac{\sigma^2}{\tau^2}\):

\[\log f(\boldsymbol \beta | \mathbf X , \mathbf y , \sigma^2) \propto -\frac{1}{2} \left[ (\mathbf y - \mathbf X \boldsymbol \beta)^T (\mathbf y - \mathbf X \boldsymbol \beta) + \lambda \boldsymbol \beta^T \boldsymbol \beta \right] \]

  • Look familiar?

Second step: take the derivative w.r.t. \(\boldsymbol \beta\):

\[ \frac{d}{d \boldsymbol \beta} \log f(\boldsymbol \beta | \mathbf X , \mathbf y , \sigma^2) \propto \frac{d}{d \boldsymbol \beta} -\frac{1}{2} \left[ \mathbf y^T \mathbf y - 2 \boldsymbol \beta \mathbf X^T \mathbf y + \boldsymbol \beta^T \left[\mathbf X^T \mathbf X + \lambda + \mathcal I_P \right] \boldsymbol \beta \right] \]

MAP Approximations

\[ \frac{d}{d \boldsymbol \beta} \log f(\boldsymbol \beta | \mathbf X , \mathbf y , \sigma^2) \propto -\frac{1}{2} \left[ - 2 \mathbf X^T \mathbf y + 2 \left[\mathbf X^T \mathbf X + \lambda \mathcal I_P \right] \boldsymbol \beta \right] \]

Setting to zero and solving for \(\boldsymbol \beta\):

\[\hat{\boldsymbol \beta} = (\mathbf X^T \mathbf X + \lambda \mathcal I_P)^{-1} \mathbf X^T \mathbf y\]

which matches our posterior mean!

MAP Approximations

The second part is to get the covariance matrix for the MVN approximation:

\[\hat{\boldsymbol \Sigma} = -\left[ \frac{\partial \log f(\hat{\boldsymbol \mu} | \mathbf X)}{\partial \boldsymbol \Theta \partial \boldsymbol \Theta} \right]^{-1} \]

This requires finding the Hessian and evaluating it at the MAP estimate

  • This covariance estimation method is called the Laplace approximation

  • It is also (roughly) equivalent to the Fisher information method approach for MLE

  • This approach is often referred to as Bayesian MLE

MAP Approximations

\[ \frac{d}{d \boldsymbol \beta} \log f(\boldsymbol \beta | \mathbf X , \mathbf y , \sigma^2) \propto -\frac{1}{2} \left[ - 2 \mathbf X^T \mathbf y + 2 \left[\mathbf X^T \mathbf X + \lambda \mathcal I_P \right] \boldsymbol \beta \right] \]

\[ \frac{d^2}{d \boldsymbol \beta^2} \log f(\boldsymbol \beta | \mathbf X , \mathbf y , \sigma^2) \propto -\frac{1}{2} 2 \mathbf X^T \mathbf X + \lambda \mathcal I_P \]

And the negative inverse is:

\[\hat{\boldsymbol \Sigma} = \left[ \mathbf X^T \mathbf X + \lambda \mathcal I_P \right]^{-1}\]

matching the posterior covariance!

MAP Approximations

The MAP approximation plus Laplace approximation yields the exact same posterior when the true posterior is MVN!

  • Closest approximation!

The quality of this approximation is going to depend on the MVN-ness of the actual posterior distribution.

  • The more normal, the better the approximation.

When \(N\) is large and the number of parameters isn’t too large, BVM states that posteriors become MVN!

  • So, most Bayesian machine learning use cases can use this approach without much loss in knowledge.

MAP Approximations

Now, let’s go back to regression with Laplace priors:

\[ f(\boldsymbol \beta | \mathbf X , \mathbf y , \sigma^2) = \frac{\prod \limits_{i = 1}^N \mathcal N(y_i | \mathbf x_i^T \boldsymbol \beta , \sigma^2)\prod \limits_{j = 1}^P \frac{1}{2b_{j0}} \exp \left[- \frac{|\beta_j - \mu_{j0}|}{b_{j0}} \right]}{\int \limits_{\boldsymbol \beta} \prod \limits_{i = 1}^N \mathcal N(y_i | \mathbf x_i^T \boldsymbol \beta , \sigma^2)\prod \limits_{j = 1}^P \frac{1}{2b_{j0}} \exp \left[- \frac{|\beta_j - \mu_{j0}|}{b_{j0}} \right] d \boldsymbol \beta} \]

Let’s look at the unnormalized posterior for a simple linear regression with intercept zero and true coefficient of 0 with no prior, a normal prior, and a Laplace prior.

MAP Approximations

MAP Approximations

MAP Approximations

The Laplace prior places a little extra mass at 0 here.

  • Sharp top due to the absolute value

Do you think that the Laplace approximation will work well here?

MAP Approximations

Regardless, let’s find the MAP estimate for this regression.

We’re going to make a simplifying assumption here: \(\mu_{j,0} = 0\) and \(b_{j,0} = b_0\) for all coefficients except for the intercept. Instead, we’re going to place a normal prior on the intercept with mean 0 and variance \(\sigma^2_{a}\). Denote the intercept as \(\alpha\). Then our posterior becomes:

\[ f(\boldsymbol \beta, \alpha | \mathbf X , \mathbf y , \sigma^2) = \frac{\prod \limits_{i = 1}^N \mathcal N(y_i | \mathbf x_i^T \boldsymbol \beta + \alpha, \sigma^2)\prod \limits_{j = 1}^P \frac{1}{2b_{0}} \exp \left[- \frac{|\beta_j|}{b_{0}} \right] \mathcal N(\alpha | 0 , \sigma^2_a)}{\int \limits_{\boldsymbol \beta, \alpha} \prod \limits_{i = 1}^N \mathcal N(y_i | \mathbf x_i^T \boldsymbol \beta + \alpha, \sigma^2)\prod \limits_{j = 1}^P \frac{1}{2b_{0}} \exp \left[- \frac{|\beta_j|}{b_{0}} \right] \mathcal N(\alpha | 0 , \sigma^2_a) d \boldsymbol \beta d \alpha }\]

MAP Approximations

Looking only at the numerator:

\[ \begin{split} f(\boldsymbol \beta, \alpha | \mathbf X , \mathbf y , \sigma^2) \propto \prod \limits_{i = 1}^N \mathcal N(y_i | \mathbf x_i^T \boldsymbol \beta + \alpha, \sigma^2)\prod \limits_{j = 1}^P \frac{1}{2b_{0}} \exp \left[- \frac{|\beta_j|}{b_{0}} \right] \mathcal N(\alpha | 0 , \sigma^2_a) \end{split} \]

The log posterior is then:

\[ \begin{split} \log f(\boldsymbol \beta, \alpha | \mathbf X , \mathbf y , \sigma^2) \propto & \sum \limits_{i = 1}^N -\frac{1}{2} \log(2 \pi \sigma^2) - \\ &\frac{1}{2 \sigma^2} (y_i - \mathbf x_i^T \boldsymbol \beta - \alpha)^T (y_i - \mathbf x_i^T \boldsymbol \beta - \alpha) + \\ & \sum \limits_{j = 1}^P - \log(2 b_0) - \frac{|\beta_j|}{b_0} \\ & -\frac{1}{2} \log(2 \pi \sigma^2_a) - \frac{1}{2 \sigma^2_a} \alpha^T \alpha \end{split} \]

MAP Approximations

Let’s reduce this into a more workable form by getting rid of any constants and simplifying our sum terms:

\[ \begin{split} \log f(\boldsymbol \beta, \alpha | \mathbf X , \mathbf y , \sigma^2) \propto -\frac{1}{2 \sigma^2} \sum \limits_{i = 1}^N (y_i - \mathbf x_i^T \boldsymbol \beta - \alpha)^T (y_i - \mathbf x_i^T \boldsymbol \beta - \alpha) - & \\ \frac{1}{b_0} \sum \limits_{j = 1}^P |\beta_j| - \frac{1}{2 \sigma^2_a} \alpha^T \alpha \end{split} \]

\[ \begin{split} \log f(\boldsymbol \beta, \alpha | \mathbf X , \mathbf y , \sigma^2) \propto & - \bigg[ \frac{1}{2 \sigma^2} \sum \limits_{i = 1}^N (y_i - \mathbf x_i^T \boldsymbol \beta - \alpha)^T (y_i - \mathbf x_i^T \boldsymbol \beta - \alpha) + \\ & \frac{1}{b_0} \sum \limits_{j = 1}^P |\beta_j| + \frac{1}{2 \sigma^2_a} \alpha^T \alpha \bigg] \end{split} \]

MAP Approximations

Finally, multiplying everything by \(2 \sigma^2\) (which we can do because everything is just proportional) and assuming \(\sigma^2_a\) is large (a weak prior):

\[ \begin{split} \log f(\boldsymbol \beta, \alpha | \mathbf X , \mathbf y , \sigma^2) \propto & - \bigg[ \sum \limits_{i = 1}^N (y_i - \mathbf x_i^T \boldsymbol \beta - \alpha)^2 + \frac{2 \sigma^2}{b_0} \sum \limits_{j = 1}^P |\beta_j| \bigg] \end{split} \]

Recognize this equation?

MAP Approximations

The interior of this log posterior is a familiar form: the LASSO!

  • By placing independent Laplace priors on the coefficients, we are able to recover the sum of squared errors plus an L1 penalty term.

  • The intercept term is treated differently so it is not shrunk towards zero.

Okay. Let’s try to find the MAP estimate.

MAP Approximations

Disappointment!

  • There is no closed form solution for the coefficients.

  • That pesky absolute value ruins everything…

  • Bummer.

However, we can derive the gradient for these coefficients

  • Just can’t solve at zero.

MAP Approximations

\[\nabla_{\alpha} = \frac{1}{N} \sum \limits_{i = 1}^N (y_i - \mathbf x_i^T \boldsymbol \beta - \alpha)\]

or the average residual given \(\boldsymbol \beta\) and \(\alpha\).

\[ \nabla_{\boldsymbol \beta} = - \frac{\lambda}{N} \text{sign}(\boldsymbol \beta) + \frac{1}{N} (\mathbf y - \mathbf X \boldsymbol \beta - \tilde{\boldsymbol \alpha})^T\mathbf X\]

where \(\tilde{\boldsymbol \alpha}\) is a \(N\) vector with the value for the intercept repeated \(N\) times and \(\lambda = \frac{2\sigma^2}{b_0}\)

Given this gradient, we can use gradient based optimizers to the find a local maximum for the Bayesian LASSO objective function and generate a MAP estimate.

MAP Approximations

We can also evaluate the Hessian for this fit to get a MAP approximation for the Bayesian LASSO

  • Not done frequently because normality of the posterior is achieved when \(N\) is quite large and the optimal \(\lambda\) is quite small

  • In other words, it doesn’t work well except in the case where OLS is essentially equivalent.

MAP Approximations

MAP approximations provide a tractable way to do Bayesian inference for different models where a conjugate prior form is not available.

  • Correct in the limit

  • Close enough in a lot of different scenarios

Basis for a common machine learning technique called variational inference

  • Need one more piece…

Marginal Likelihood

Recall that the posterior for the unknowns given the knowns and a prior can be written as:

\[ f(\boldsymbol \theta | \mathbf X) = \frac{f(\mathbf X | \boldsymbol \theta)f(\boldsymbol \theta)}{f(\mathbf X)} = \frac{f(\mathbf X | \boldsymbol \theta)f(\boldsymbol \theta)}{\int \limits_{\boldsymbol \theta} f(\mathbf X | \boldsymbol \theta)f(\boldsymbol \theta) d \boldsymbol \theta} \]

  • The numerator is random with respect to the unknowns

  • The denominator is a constant since it is computed marginalizing over the unknowns.

  • e.g. the marginal likelihood of the data

Marginal Likelihood

The denominator has been written in a way to reduce the complexity of the term. For regression exercises treating \(\sigma^2\) as known, this denominator is really:

\[f(\mathbf y | \mathbf X , \mathbf \sigma^2, \boldsymbol \Pi)\]

where \(\boldsymbol \Pi\) is the collection of prior hyperparameters (e.g. the prior mean and covariance).

This quantity tells us the likelihood we would observe the outcomes given any fixed parameters marginalizing over the unknowns.

Marginal Likelihood

Let’s think about what this means

  • Given the prior and model choice (the unknown parameters), what is the probability that we would see the outcomes w.r.t. my prior?

  • Larger values correspond to good fits to the data.

  • Smaller values indicate that the data isn’t well explained by the model.

Marginal Likelihood

Key point: the marginal likelihood is a Bayesian measure of the generalization error for a model!

  • If the prior encodes the probability weighted viable values of the unknowns, then we’re assessing how well the model fits the data w.r.t all values implied by the prior

  • The more overfit a model is, the more the “quality” of the fit depends on having just the right coefficients

  • Little changes in the unknowns lead to large drops in fit!

Marginal Likelihood

Can also show that the marginal likelihood is roughly equivalent to leave one out cross validation

\[LOOCV(\mathbf y) = \frac{1}{N} \sum \limits_{i = 1}^N \log f(y_i | \mathbf x_i , \hat{\boldsymbol \Theta}_i(y^{(-i)},\mathbf X^{(-i)}))\]

Log marginal likelihood:

\[\log f(\mathbf y | \mathbf X , \boldsymbol \Pi)\]

  • We don’t want to make the assumption that each \(y_i\) is an i.i.d. draw from the distribution of the knowns.

  • In the regression case, we construct the problem such that \(f(y_i | \mathbf x_i , \boldsymbol \beta, \sigma^2)\) is independent of \(f(y_j | \mathbf x_j , \boldsymbol \beta, \sigma^2)\) conditional on the coefficients.

  • However, we do not know it cannot be true that \(f(y_i | \mathbf x_i , \sigma^2)\) is independent of \(f(y_j | \mathbf x_j , \sigma^2)\) since they share a common DGP relying on \(\boldsymbol \beta\).

Marginal Likelihood

Always true, though:

\[\log f(\mathbf y | \mathbf X , \boldsymbol \Pi) = \sum \limits_{i = 1}^N \log f(y_i | \mathbf y^{(-i)}, \mathbf X^{(-i)}, \boldsymbol \Pi)\]

  • The two are almost equivalent!

  • Difference is LOOCV requires \(N\) computations of the model while the marginal likelihood is computed once

This is great! Why is this the first time you’re hearing about it?

Marginal Likelihood

It can be quite difficult to compute. Generally, the integral:

\[\int \limits_{\boldsymbol \theta} f(\mathbf X | \boldsymbol \theta)f(\boldsymbol \theta) d \boldsymbol \theta\]

is intractible and cannot be computed analytically.

  • However, there are some cases where we can compute this integral!

  • One is linear regression with a conjugate normal prior.

Marginal Likelihood

Sparing the gory details of this derivation (see Chapter 2 of PML2 for these), the marginal likelihood for a regression model with a MVN likelihood and MVN prior is:

\[ \begin{split} \int \limits_{\boldsymbol \beta} & \mathcal N_N(\mathbf y | \mathbf X \boldsymbol \beta , \sigma^2 \mathcal I_N) \mathcal N_P(\boldsymbol \beta | \boldsymbol \mu_0 , \boldsymbol \Sigma_0) d \boldsymbol \beta = \\ & \mathcal N_N(\mathbf y | \mathbf X \boldsymbol \mu_0 , \sigma^2 \mathcal I_N + \mathbf X \boldsymbol \Sigma_0 \mathbf X^T) \end{split} \]

  • This can easily be computed since all of the inputs are known

Marginal Likelihood

An additional case where we can compute the marginal likelihood is when we use the MAP estimate with Laplace approximation to get an approximate posterior.

\[\int f(\mathbf X | \boldsymbol \theta)f(\boldsymbol \theta) d\boldsymbol \theta \approx f(\mathbf y | \mathbf X, \hat{\boldsymbol \mu}) f(\hat{\boldsymbol \mu}) (2 \pi)^{P/2} \text{det}\left[\hat{\boldsymbol \Sigma} \right] N^{-P/2}\]

  • The denominator when the posterior is MVN is pretty easy to compute!

Marginal Likelihood

When can we use this?

  • Generally, Bayesian methods try to approximate this high dimensional integral using Monte Carlo methods (see chapter 3 of my dissertation)

  • Selecting \(\lambda\) for ridge regression

  • The ARD prior (outlined in the lecture notes)

Marginal Likelihood

Recall that Ridge regression can be expressed as a Bayesian problem:

  • Multivariate normal likelihood with mean \(\mathbf X \boldsymbol \beta\) and covariance matrix \(\sigma^2 \mathcal I_N\)

  • Multivariate normal prior with mean \(\mathbf 0\) and covariance matrix \(\tau^2 \mathcal I_P\)

We can always rescale the problem such that the likelihood has covariance \(\mathcal I_N\) and the prior has covariance \(\frac{\sigma^2}{\tau^2} \mathcal I_P\).

  • Let \(\lambda = \frac{\sigma^2}{\tau^2}\)

The marginal likelihood given \(\lambda\) is then:

\[f(\mathbf X, y | \lambda) = \mathcal N_N(\mathbf y | \mathbf 0 , \mathcal I_N + \lambda \mathbf X \mathbf X^T)\]

  • Find \(\lambda\) that maximizes this quantity
  • A model specific quantity that we would like to learn and marginalize over
  • e.g. fix

Marginal Likelihood

Goal: find \(\hat{\lambda}\) such that:

\[ \hat{\lambda} = \underset{\hat{\lambda}}{\text{argmax }} f(\mathbf X, \mathbf y | \lambda) \]

  • Integrating over the unknowns/learnable parameters, find the value of the nuisance parameter that maximizes the marginal likelihood

  • Plug this in or jointly learn the corresponding parameter values to fully specify the posterior!

Marginal Likelihood

Marginal Likelihood

Wrap-Up

Bayesian machine learning is a common framework

  • You’ll run into it a lot!

This is just a fast overview of important topics

  • Way more out there!

PML1 and PML2 are great references for this material

  • We’ll discuss more as needed

Next Time

Variational Inference for Autoencoders!

Posterior Predictives

One other quick bite:

The posterior is a distribution over the unknowns. If our prediction made at any feature vector, \(\mathbf x_0\) is a function of these unknowns, then the prediction also follows a distribution.

Posterior predictive distribution

\[f(y_0 | \mathbf x_0 , \mathbf X , \mathbf y, \hat{\boldsymbol \theta}) = \int \limits_{\hat{\boldsymbol \theta}} f(\mathbf y_0 | \mathbf x_0, \mathbf y , \mathbf X, \hat{\boldsymbol \theta}) f(\hat{\boldsymbol \theta} | \mathcal D) d \hat{\boldsymbol \theta}\]

  • What is the distribution of our prediction given the prior beliefs and conditioning on the training data?

Posterior Predictives

For the MVN/MVN regression model, the posterior predictive distribution has a closed form:

\[f(y_0 | \mathbf x_0 , \sigma^2, \mathbf X , \mathbf y) = \mathcal N(y_0 | \hat{\boldsymbol \mu}^T \mathbf x_0 , \sigma^2 + \mathbf x_0^T \hat{\boldsymbol \Sigma} \mathbf x_0)\]

  • Can be used to assess the quality of the fit to the training data

  • See 11.7.4 in PML1 for more discussion.

  • Note that this is the MAP estimate of the prediction with variance equal to the error variance plus the model variance (variance over the coefficients) evaluated at the feature vector!

Posterior Predictives

<Figure size 960x480 with 0 Axes>

Posterior Predictives

<Figure size 960x480 with 0 Axes>